Banking dataset Costumer Segmentation

Names: Yoshi Qi

1. Introduction and Dataset Research

Bank Marketing Data Set:

Data Set Information:

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

There are four datasets: 1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014] 2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs. 3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). 4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Citation

[1] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

[2] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimaraes, Portugal, October, 2011. EUROSIS. [bank.zip]

Task: Can you cluster the customers and show their clusters?

2. Data Cleaning and Data Manipulation

First of all we seperate all numeric variables from the dataframe and see if by using only these variables we can draw meaningful results

3. Basic Descriptive Analytics

See the clusterability with t-SNE

4. Dataset Scaling Decisions

Appereatly the scale of the data could be vastly different, hence the drastic different scale of variance. Hence we should do the standard scaling for the data

variabce of different attributes are very different, so it's better to scale the dataset

5. Clusterability and Clustering Structure Questions

Hopkins Statistics are close to 0 indicating that the data is highly clusterable.

The TSNE suggests that there are 8 to 12 clusters in the dataset, which are convex, but not balanced in size. But clusters also don't seperate well in general and may seem to overlap

6. Algorithm Selection

6.1 DBSCAN using only numerical attributes

Use only the numerical attributes to do the DBSCAN. Since we think the dataset has convex, but not balanced in size clustring structure, so I think DBSCAN could be a way to evaluate the clustering structure.

6.2 Use both categorical and numerical attributes

  1. Evaluate with Gower’s distance
  2. use the categorical attribute as well as numerical attribtues and use Kprototype and Hierarchical Clustering

7. DBSCAN with only the numerical attributes

7.1. Parameter Selection

reference for the previous analysis

The analysis above is a demonstration of how to systematically locate the elbow point to reduce computational burden and narrow down the range of parameter.

From the analysis above, we see that setting minpts=108 gives the highest Calinski-Barbarsz. Also the number of clusters and noise points seems reasonable. So we'll fine tune the parameters from there

So from the fine tune section, we see that for minpts=108 $\epsilon = 1.15$ gives the highest Carlinski-Harabasz score and relatively high Average Silhouette score but meanwhile the noisy points are relatively high. While $\epsilon = 1.22$ gives highest Average Silhouette score but low Carlinski-Harabasz score. However the number of noise points are much lower than $\epsilon = 1.15$. So here we pick $\epsilon=1.22$ while minpts=108

7.2. Clustering Algorithm

7.3. Clustering Algorithm Results Presentation

7.4. Assessing Clustering Separation and Cohesion

From the Silhoutte plot we can see that cluster 2 and 6 have particularity bad cohesion and separation while most of the cluster members fall below the average silhoutte score. Meanwhile other major clusters seems to be fine in terms of cohesion and separation. Overall the clustering

7.5. Additional Analysis

The similarity matrix shows that the clusters are somewhat close to one another.

7.6. Describing Each of the Clusters

8. Clustering using Categorical Features: Kprototype

8.1. Parameter Selection

The t-SNE plto shows a clear clustering structure, and the suggested number of cluster is around 6

The elbow plot of cost values doesn't support that the data has a clear clustering structure, which is contradictory to the t-SNE plot. So I'll go with the t-SNE plot which shows a clear structure and see if the results align with plot.

From the t-SNE Plot and the average silhouette score plot, we can see that k=4 both fit the t-SNE better and also has relatively high average silhouette score.

8.2 Clustering Algorithm

8.3 Clustering Algorithm Result Presentation

8.4. Assessing Clustering Separation and Cohesion

9. Clustering using Categorical Features: Hierarchical Clustering

9.1 Parameter Selection

It seems that the most appropriate choice is average link with k=5.

9.2 Clustering Algorithm

9.3 Clustering Algorithm Result Presentation

9.4. Assessing Clustering Separation and Cohesion

10. Analysis Summary and Conclusion

10.1. Algorithm Comparison Summary

While we are comparing Algorithms runing on different subsets of a data set(DBSCAN on numerical features, Kprototype and Hiarchical Clustering on mixed features), it's not completely accurate to compare average Silhouette score directly. But roughly average Silhouette scores are not too different.

DBSCAN

Judging from the result, DBSCAN suggests 8 clusters but a fair poportion of the data points are classified as outlier.

Kprototype

Kprototype suggest 4 clusters

Hiarchical Clustering

While Hiarchical Clustering suggest a average link with 5 clusters

10.2. Conclusion and Insights Summary

Considering the huge size of data, DBSCAN is not only more efficient and easier to imterpretate but also performed better in terms of identifying not only the major cluster but also smaller clusters which in business usually generate the most revenue. The only draw back is the fair poportion of outliers that are excluded from the clustering result. Kprototype and Hiarchical Clustering both used gower's distance hence take up much more computational power and time. All three models proposed different number of clusters, so the we can also combine business insight and field experience to select the appropriate model based on the results.